-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[Wasm RyuJIT] Initial writeup on the calling convention #122988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Wasm RyuJIT] Initial writeup on the calling convention #122988
Conversation
Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.
|
PTAL @dotnet/wasm-contrib... this is a first draft so please comment / help fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive documentation for the WebAssembly calling convention used by R2R (Ready-to-Run) and JIT-compiled managed code in the CLR. The documentation describes how the runtime interoperates with WebAssembly's stack model, calling sequences, and garbage collection integration.
Key Changes:
- Adds a new "Web Assembly ABI (R2R and JIT)" section to the CLR ABI documentation
- Documents stack layout, argument passing conventions, prolog/epilog behavior, and calling sequences
- Explains GC reference handling at call sites and the portable entry point mechanism
Co-authored-by: Copilot <[email protected]>
|
This is probably a good time to raise that I think we shouldn't pass the stack pointer in an argument.
Reasons against it:
Given that we're not linking I view the sp argument thing as a Potential Optimization and I feel like it's premature. Single or Yowl may have a good reason why we should still do it though based on their experiences. I believe what we would do instead is export the stack_pointer from the runtime's wasm module (this may already be the default behavior for emscripten, to enable dynamic linking?) then grab the WebAssembly.Global for the stack_pointer and import it into every r2r module we load. Then all our code can manipulate the linear stack the same way clang generated code does. |
Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument. |
It doesn't seem necessary to optimize for the stub case? Since stubs will be a very small proportion of the overall code, and every managed callsite will need to be made at least 2 bytes bigger (for |
|
How will a |
Aren't globals functionally thread-local in wasm? Is it different for wasi? |
I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls. Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it ( Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one. Another edit: more - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2024/THREADS-07-09.md. |
This also appears to mention that importing a global turns it into two loads instead of one, which is a bit of a problem. But I think we have to import sp no matter what and it's just a question of how often we're going to touch it. |
|
Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara |
|
Some crude estimates of the size costs of an SP arg vs maintaining the global SP. Global SP always in syncSo 10/18 bytes per prolog, 7/11 bytes per epilog No overhead at call sites. Smaller signatures. Global SP lazy sync at boundariesFor the current x86 crossgen SPMI collection (which may not represent the set of methods we care about) there are 273829 methods / prologs, 312716 epilogs, 1504289 managed call sites, and 163641 helper call sites. So assuming the optimistic case where we can encode the global SP index in one byte and wrap FCalls with a global SP update (size assumed negligible) I get size estimates like: If we can't get a small index for the SP global then the "sync" cost rises to 8.3M (~30 bytes/method). This may overstate the difference somewhat. For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet. |
This matches my calculations based on NAOT-LLVM data from last year almost perfectly: "50%" code size for the large index encoding, about the same for the small encoding.
The proportion of truly leaf methods is going to be rather low. It was on the order of < 5% on real-world NAOT-LLVM data (this is from dotnet/runtimelab#2697). This is because most methods may throw (an NRE), requiring a helper call. |
Was this with assumption that |
Yes. I can re-measure how this works out with the non-null-this assumption. Though I personally wouldn't support such a thing "for WASM only". And it would be a breaking change for structs ( |
|
I believe we have some non-deterministic behaviors around null this pointer for reference types today. I would not feel bad about more UB there by default. It is impossible to end up with null this for reference types in C#.
I guess we can keep them for structs. |
I'll write down my thinking on this question. I'll use Code SizeAs per Andy's study above (and my earlier investigation as well), the size impact is "weakly in favor" of using "Weakly in favor" because:
ThroughputThis point is clearly in favor of Runtime complexity & interopThis point is clearly in favor of
Based on the above I would personally be weakly in favor of |
|
From what I can tell there is no way to have LLVM inline assembly insert instructions before the prolog, so something like generates (without opts) code like this per compiler explorer So it seems with the pure lazy $ If we do the lazy sync and can't or don't want to do wrappers, we can reduce the sync cost a bit further by only setting |
Do I understand right that we only need this for NativeAOT-LLVM ? Maybe it would be ok to create wrapper function with the https://manpages.debian.org/testing/binaryen/wasm-opt.1.en.html#one |
|
Here is how we create direct wasm with LLVM |
|
I don't think we need any fancy tooling to create these wrappers. All fcalls are implemented with macros that already have most of the information we need - it will need to be augmented with the ABI types (alternatively, you can play games with compile-time string concatenation using // Usage
#define FCIMPL_VOID_I(foo, void *p)
// Rough definition
#define FCIMPL_VOID_I(funcname, a1) \
__asm(
.functype funcname (i32, i32) -> ()
.global funcname
funcname:
local.get 0
global.set __stack_pointer
local.get 1
call funcname#_native
end_function
);
void F_CALL_CONV funcname##_native(a1) { FCIMPL_PROLOG(funcname) |
|
Given the above, I propose that we go with the lazy For the managed wrappers outlined above, do we need to be careful not to mess things up for the interpreter? |
It should just work. The interpreter will call FCalls like any other method with native code. |
Co-authored-by: Jan Kotas <[email protected]>
|
/ba-g markdown only change (build analysis seems to be confused) |
Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.